Robots.txt – The complete checklist for Blogger, WordPress & other

In seo, everybody talks about keywords, backlinks, and content material nice.but there’s one tiny report that often goes not noted—but it could make or break your website’s seek overall performance.that record is robots.txt its necessary for seo setting.

Think of robots.txt as a set of visitor rules for search engines like Google.It tells Google and other crawlers which parts of your website they’re allowed to explore and which ones to avoid.Done right, it can improve crawling efficiency, conserve crawl budget, and keep inappropriate or duplicate content out of search results.Done wrong, it can block your site from ranking altogether.

In this guide, we’ll walk through a complete robots.txt checklist with detailed explanations, common mistakes, and platform-specific examples for Blogger, WordPress, Shopify, Wix, and more.

via the end, you’ll know exactly how to configure robots.txt the right way—whether you’re running a personal blog or a massive eCommerce site.

A digital illustration of a robot holding a robots.txt file, symbolizing website crawling and SEO control.

1) Robots.txt fundamentals (syntax & placement)

File location: must live at the site root (e.g., https://lnkd.in/robots.txt). Putting it in a subfolder (e.g., /userpages/yourname/robots.txt) does not work.

Scope: a robots.txt file applies only to the host/subdomain it’s served from. If you have multiple subdomains, each needs its own robots.txt.

Blocks crawling, not indexing: disallowed URLs might still be indexed if discovered via external links; use noindex (meta or HTTP header) for true de-indexing.

Minimum structure: every ruleset must begin with a User-agent: line, or nothing is enforced. Paths in Allow/Disallow must start with /.

Minimal skeleton

User-agent: *
Disallow:

(Empty Disallow means “crawl everything.”)

2) Matching rules you’ll actually use

Robots.txt supports limited wildcards:

* = matches zero or more characters
$ = end of URL (end-anchor)

Common, safe patterns

Block entire site (staging only!)
User-agent: *
Disallow: /

(Useful for staging; be sure to remove on production.)

Block specific folders
User-agent: *
Disallow: /calendar/
Disallow: /junk/
Disallow: /books/fiction/contemporary/

(Append / to block a whole directory.)

Allow only Google News; block everyone else
User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /

Allow all but one bot
User-agent: Unnecessarybot
Disallow: /

User-agent: *
Allow: /

Block single pages
User-agent: *
Disallow: /useless_file.html
Disallow: /junk/other_useless_file.html

Block entire site but allow one public section
User-agent: *
Disallow: /
Allow: /public/

Block all images from Google Images
User-agent: Googlebot-Image
Disallow: /

(Or block a single image: Disallow: /images/dogs.jpg.)

Block a file type everywhere (end-anchor)
User-agent: Googlebot
Disallow: /*.gif$

($ ensures you only block URLs ending with .gif.)

Block URL patterns with query strings
User-agent: *
Disallow: /*?

(Blocks any URL containing ?.)

Block PHP pages (anywhere in path) vs only those ending with .php
User-agent: *
Disallow: /*.php # contains
Disallow: /*.php$ # ends with .php

(Understand the difference between /*.php and /*.php$.)

3) Fine-grained control with Allow + Disallow

When rules conflict, Google chooses the most specific path (longest matching string). If lengths tie, Allow wins.

Classic session-ID pattern
Block duplicates with ?, but allow a canonical “?-only” version (ends with ?):
User-agent: *
Allow: /*?$
Disallow: /*?

(Blocks any URL that contains ?, but allows URLs that end with ?.)

Unblock a “good” page inside a blocked folder
User-agent: *
Allow: /baddir/goodpage
Disallow: /baddir/

(The longer Allow path beats the shorter Disallow.)

Be careful with overlapping patterns
User-agent: *
Allow: /some
Disallow: /*page

/somepage is blocked (the /*page path is longer).

4) Order of precedence & conflict resolution (how crawlers decide)

Most specific path wins (longest match).
If equally specific, Google uses the least restrictive rule (i.e., Allow).

5) High-risk mistakes to avoid

Leaving “Disallow: /” on production after launch.

Keep staging behind password (e.g., HTTP auth) so you can ship the same robots.txt to prod safely.

Trying to block hostile scrapers via robots.txt

Bad actors ignore robots.txt. Use firewalls, IP/user-agent blocking, rate limiting, or bot management.

Listing secret directories in robots.txt

This advertises where your private content lives. Use authentication. Band-aids: noindex meta or X-Robots-Tag (but still not a substitute for security).

Accidental over-blocking with broad prefixes

Disallow: /admin also blocks /administer-medication….

Safer pair:
Disallow: /admin$
Disallow: /admin/

($ blocks exactly /admin, while the second blocks the folder.)

Forgetting the User-agent line

Rules won’t apply without it. Also avoid mixing a general block with a later specific bot rule unless you repeat shared rules under each bot block.

Case sensitivity

Paths are case sensitive. To block all variants, list each case explicitly.

Trying to control other subdomains from one robots.txt

Each subdomain needs its own robots.txt at its own root.

Using robots.txt as “noindex”

Disallow ≠ De-index. Use meta noindex/X-Robots-Tag for reliable removal from search. The PDF explicitly clarifies this (including a Bengali note explaining that Google may index a disallowed URL if discovered elsewhere).

6) Debugging & auditing checklist

Confirm location: https://{host}/robots.txt loads and is publicly readable.
Validate syntax: every group begins with User-agent. All paths start with /.

Scan for risky patterns: overly broad prefixes (e.g., /adm) that might catch unrelated pages (e.g., /administer…). Use $ and explicit folder slashes.

Check wildcards: remember * is greedy; $ pins “end of URL.” Avoid trailing * after bare paths because /fish and /fish* behave the same.

Conflict resolution: if a URL matches multiple rules, the longest path wins; if tie, Allow wins. Test suspect URLs against your rule set.

Don’t rely on robots.txt to hide content: for sensitive assets, use authentication; for de-indexing, use noindex (meta or HTTP header).

Robots.txt Examples for Popular Platforms

Now let’s go platform by platform:

Robots.txt for WordPress

Most WordPress sites generate a default robots.txt file. But it often needs customization.

Recommended WordPress Robots.txt


User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

WordPress Best Practices

Don’t block /wp-content/ (contains CSS/JS needed for rendering).
Use SEO plugins like Yoast SEO or Rank Math to edit robots.txt directly.
Always include your sitemap.

Robots.txt for Blogger

Blogger provides an option for Custom Robots.txt under Settings → Crawlers & Indexing.

Example Blogger Robots.txt


User-agent: *
Disallow: /search
Allow: /

Sitemap: https://yourblog.blogspot.com/sitemap.xml

Why Block `/search`?

Because Blogger automatically creates duplicate URLs like:
https://yourblog.blogspot.com/search/label/SEO

Blocking /search prevents wasted crawl budget and duplicate indexing.

Robots.txt for Shopify

Shopify auto-generates robots.txt, but you can now edit it.

Example Shopify Robots.txt


User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /orders
Disallow: /admin

Sitemap: https://example.com/sitemap.xml

Best Practice

Block cart/checkout/order pages.
Keep product and category pages open.

Robots.txt for Wix & Squarespace

Both generate robots.txt automatically.
You can edit in Site Settings.
Ensure duplicate pages, filter URLs, and backend areas are blocked.

Robots.txt for Custom CMS

If you’re using a custom-built site:

Upload robots.txt manually via FTP or cPanel.
Example template:


User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /tmp/
Allow: /

Sitemap: https://example.com/sitemap.xml
  Quick “Do / Don’t” recap

Do
 Put robots.txt in the root of every host/subdomain you control.
 Use * and $ deliberately for precise matching.
 Use paired rules for tricky params (?id= and &id=).
 Prefer meta/X-Robots-Tag noindex for removal from search results.

Don’t
 Put secrets in robots.txt (it advertises them).
 Expect bad crawlers to obey robots.txt.
 Forget User-agent or the leading / in paths.
 Try to control other subdomains from one robots.txt.

Finally, Test in robots.txt Validator and Testing Tool
 Frequently Asked Questions (FAQs) 
1. What is a robots.txt file?
Robots.txt is a simple text file placed in the root directory of a website. It gives instructions to search engine crawlers about which pages or sections of the site they can or cannot crawl.

2. Where should I place the robots.txt file?
The robots.txt file must be placed in the root directory of your domain. Example:

✅ https://example.com/robots.txt

❌ https://example.com/folder/robots.txt

3. Does robots.txt block a page from Google completely?
No. Robots.txt prevents crawling but not indexing. If a blocked page is linked from elsewhere, Google may still index its URL (without content). For full control, use the noindex meta tag or HTTP headers.

4. Can robots.txt hide sensitive data?
No. Robots.txt is public, so anyone can view it. To protect sensitive information (like admin or customer data), use password protection, firewalls, or server-side restrictions.

5. What happens if I block CSS and JavaScript in robots.txt?
Blocking CSS/JS prevents Google from rendering your site properly. This can harm rankings since Google evaluates the full user experience. Always allow CSS and JS files.

6. Is robots.txt necessary for every website?
Not always. Small websites with only a few pages can work fine without it. However, for blogs, eCommerce sites, or large platforms with many URLs, robots.txt is highly recommended to manage crawl budget efficiently.

7. How do I test my robots.txt file?
You can test it using Google Search Console’s robots.txt Tester. It allows you to check whether specific pages are being blocked or allowed for Googlebot.

8. What is the difference between robots.txt and meta robots tag?

Robots.txt → Controls crawling (which pages bots can visit).
Meta robots tag → Controls indexing (whether a page should appear in search results).

Best practice is to use both together for maximum control.



9. Can I use robots.txt for specific search engines only?
Yes. You can set rules for specific crawlers by mentioning their user-agent. Example:
User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /test/


10. What are common mistakes in robots.txt?
Accidentally blocking the entire site with Disallow: /
Blocking CSS/JS files.
Using it to hide sensitive data.
Forgetting to add sitemap reference.
Having conflicting rules that confuse crawlers.

dseohum

Seo guru Page 2